This R markdown document describes the methodology and results of a portion of the data analysis we conducted in support of a reporting project examining the effects of tree canopy inequity across the city of Baltimore, especially as it relates to climate change.
In general, this document is arranged into analyses of the following categories, though there are some cases where one statistic depends on multiple categories (e.g. the canopy cover in a hot neighborhood):
Before running this file, please view and run the Code Red Data Cleaning document for this project. As well as outputting necessary cleaned data for the following ananlysis, that document also includes the following items necessary to understand this analysis:
#######################
#### Load Packages ####
#######################
library(tidyverse)
library(DescTools) # For %like% operator
library(corrr) # For correlation matrices
library(colorspace) # For improved color palettes
library(ggplot2) # For graphing
library(ggrepel) # For graph labeling
require(scales) # For percent labeling on distribution tables
#library(here) # For cleaner file path writing
# Turn off scientific notation in RStudio (prevents coersion to character type)
options(scipen = 999)
#########################
#### Store Variables ####
#########################
#### Common path to data ####
path_to_data <- "../data/output-data/cleaned/"
#### NSAs of interest ####
target_nsas <- c("Berea", "Broadway East", "Oliver", "Middle East",
"Biddle Street","Milton-Montford", "Madison-Eastend",
"CARE", "McElderry Park", "Ellwood Park/Monument",
"Patterson Place", "Patterson Park Neighborhood",
"Baltimore Highlands", "Highlandtown",
"Upper Fells Point") %>%
lapply(tolower)
counterpoint_nsas <- c("Butcher's Hill", "Canton", "Washington Hill", "Roland Park") %>%
lapply(tolower)
#### CSAs of interest ####
target_csas <- c("Greater Roland Park/Poplar Hill", "Canton", "Patterson Park North & East", "Greenmount East", "Clifton-Berea") %>%
lapply(tolower)
###################
#### Load Data ####
###################
blocks_tree_temp_demographics <-
read_csv(paste0(path_to_data, "blocks_tree_temp_demographics.csv")) %>%
mutate_at(vars(matches("geoid10")), as.character) # Recast non-calculable variables as characters
csa_tree_temp_demographics <-
read_csv(paste0(path_to_data, "csa_tree_temp_demographics.csv"))
nsa_tree_temp <-
read_csv(paste0(path_to_data, "nsa_tree_temp.csv"))
zcta_tree_temp_demographics <-
read_csv(paste0(path_to_data, "zcta_tree_temp_demographics.csv")) %>%
mutate_at(vars(matches("zcta")), as.character) # Recast non-calculable variables as characters
redlining_tree <- read_csv(paste0(path_to_data, "redlining_tree.csv"))
street_trees_nsa_categorized <-
read_csv(paste0(path_to_data, "street_trees_nsa_categorized.csv"))
street_trees_nsa_summarized <-
read_csv(paste0(path_to_data, "street_trees_nsa_summarized.csv"))
The following arranges and ranks blocks across the city by temperature in the afternoon of August 29, 2018, as explained in the Code Red Data Cleaning document.
blocks_tree_temp_demographics %>%
select(geoid10, temp_mean_aft) %>%
mutate(rank = rank(-temp_mean_aft)) %>%
arrange(rank)
Below, we see the block on N. Milton Avenue between Oliver and Federal is one of the city’s hottest, ranking at 236 out of 13,598 blocks. The block GEOID for this block was pulled from QGIS and breaks down into the following codes:
blocks_tree_temp_demographics %>%
select(geoid10, temp_mean_aft) %>%
mutate(rank = rank(-temp_mean_aft)) %>%
filter(geoid10 == "245100803011000")
The following arranges and ranks NSAs across the city by temperature in the afternoon of August 29, 2018.
First, we can see that all of the top 10 hottest NSAs are located in the south of the city, and many are in the south-east.
# Top 10 hottest neighborhoods
nsa_tree_temp %>%
select(nsa_name, temp_mean_aft) %>%
mutate(rank = rank(-temp_mean_aft)) %>%
arrange(rank)
Looking at the 10 coolest NSAs, we see it is generally true that they are located far to the west and north, on the outskirts of Baltimore.
# Top 10 coolest neighborhoods
nsa_tree_temp %>%
select(nsa_name, temp_mean_aft) %>%
mutate(rank = rank(-temp_mean_aft)) %>%
arrange(desc(rank))
Below we see:
nsa_tree_temp %>%
select(nsa_name, temp_mean_aft) %>%
filter((temp_mean_aft == min(temp_mean_aft)) | (temp_mean_aft == max(temp_mean_aft))) %>%
arrange(desc(temp_mean_aft))
The difference in temperatures between the city’s hottest and coolest neighborhoods is -8.65 degrees Fahrenheit.
Below, we see the relative ranks of the Broadway East and Roland Park neighborhoods, placing Roland Park at 263 out of 278 neighborhoods, while Roland Park is ranked at 16.
nsa_tree_temp %>%
select(nsa_name, temp_mean_aft) %>%
mutate(rank = rank(-temp_mean_aft)) %>%
arrange(rank) %>%
filter((nsa_name %like% "roland park") | nsa_name %like% ("broadway east"))
These data are visualized in the following choropleth map, which was exported from QGIS and prettified slightly in Illustrator:
For the demographic anaylsis, we used CSA rather than NSA geographic segments, for reasons explained in the Code Red Data Cleaning document. The CSAs and NSAs do not allign completely. The demographic information is primarily from the Baltimore Neighborhood Indicators Alliance. Further explanations about the source data are in the Code Red Data Cleaning document.
First, the top 10 CSAs ranked by percent of families living below the poverty line:
csa_tree_temp_demographics %>%
select(csa2010,
avg_household_income = median_household_income,
perc_below_poverty = percent_of_family_households_living_below_the_poverty_line) %>%
mutate(rank = rank(-perc_below_poverty)) %>%
arrange(rank)
Next, we see only the CSAs containing the NSAs mentioned in the story, ranked by percent of families living below the poverty line (out of 55 ranks).
csa_tree_temp_demographics %>%
mutate(rank = rank(-percent_of_family_households_living_below_the_poverty_line)) %>%
arrange(rank) %>%
mutate(associated_nsa = case_when(
(csa2010 %like% "%clifton%") | (csa2010 %like% "%greenmount%") ~ "broadway east",
(csa2010 %like% "%madison%") | (csa2010 %like% "%park north%") ~ "mcelderry park",
(csa2010 %like% "%poplar%") ~ "roland park",
T ~ NA_character_
)) %>%
select(csa2010, associated_nsa,
avg_household_income = median_household_income,
perc_below_poverty = percent_of_family_households_living_below_the_poverty_line,
rank) %>%
filter((csa2010 %like% "%clifton%") |
(csa2010 %like% "%greenmount%") |
(csa2010 %like% "%madison%") |
(csa2010 %like% "%park north%") |
(csa2010 %like% "%poplar%"))
Broadway East, as a combination of Clifton-Berea and Greenmount East, has a total percent of family households living below the poverty line of 25.93 according to:
(csa_tree_temp_demographics$percent_of_family_households_living_below_the_poverty_line[csa_tree_temp_demographics$csa2010 %like% "%greenmount%"] +
csa_tree_temp_demographics$percent_of_family_households_living_below_the_poverty_line[csa_tree_temp_demographics$csa2010 %like% "%clifton%"]) / 2
## [1] 25.92508
The following tree canopy data is based on LIDAR data from 2015, as explained in the Code Red Data Cleaning document.
A correlation matrix of poverty and canopy cover shows a weak negative correlation of -.34. In other words, places with a high poverty rate will have fewer trees, in general.
#### Build correlation matrix ####
csa_tree_temp_demographics %>%
select(perc_below_poverty = percent_of_family_households_living_below_the_poverty_line,
avg_canopy_2015 = `15_lid_mean`) %>%
as.matrix() %>%
correlate() %>%
mutate(variable=rowname) %>%
select(variable, everything(), -rowname)
Here are the data related to poverty, income and canopy at the top and bottom five CSAs when ranked for poverty:
csa_tree_temp_demographics %>%
select(csa2010,
avg_household_income = median_household_income,
perc_below_poverty = percent_of_family_households_living_below_the_poverty_line,
avg_canopy_2015 = `15_lid_mean`) %>%
mutate(rank_poverty = rank(-perc_below_poverty),
rank_canopy = rank(-avg_canopy_2015)) %>%
filter(between(rank_poverty, 1L, 5L) |
between(rank_poverty, 51L, 55L)) %>%
arrange(rank_poverty)
Looking again at only the NSAs of interest in the story:
csa_tree_temp_demographics %>%
mutate(rank_perc_poverty = rank(-percent_of_family_households_living_below_the_poverty_line),
rank_perc_canopy = rank(-`15_lid_mean`)) %>%
arrange(rank_perc_canopy) %>%
mutate(associated_nsa = case_when(
(csa2010 %like% "%clifton%") | (csa2010 %like% "%greenmount%") ~ "broadway east",
(csa2010 %like% "%madison%") | (csa2010 %like% "%park north%") ~ "mcelderry park",
(csa2010 %like% "%poplar%") ~ "roland park",
T ~ NA_character_
)) %>%
select(csa2010, associated_nsa,
avg_household_income = median_household_income,
perc_below_poverty = percent_of_family_households_living_below_the_poverty_line,
perc_canopy = `15_lid_mean`,
rank_perc_poverty,
rank_perc_canopy) %>%
filter((csa2010 %like% "%clifton%") |
(csa2010 %like% "%greenmount%") |
(csa2010 %like% "%madison%") |
(csa2010 %like% "%park north%") |
(csa2010 %like% "%poplar%"))
If we build a correlation matrix only looking at certain neighborhoods of interest, the pattern is even more pronounced:
#### Build correlation matrix ####
csa_tree_temp_demographics %>%
filter(csa2010 %in% target_csas) %>%
select(perc_below_poverty = percent_of_family_households_living_below_the_poverty_line,
avg_canopy_2015 = `15_lid_mean`) %>%
as.matrix() %>%
correlate() %>%
mutate(variable=rowname) %>%
select(variable, everything(), -rowname)
Below is the trend viewed graphically:
# CSAs to call out
callout_ls <- c("Canton", "Clifton-Berea", "Greater Roland Park/Poplar Hill", "Greenmount East")
## POVERTY TO TREE COVER
csa_tree_temp_demographics %>%
mutate_at(vars("csa2010"), str_to_title) %>%
# Start ggplot and set x and y for entire plot
ggplot(aes(
x = percent_of_family_households_living_below_the_poverty_line/100,
y = `07_lid_mean`
)) +
# This section for the basic scatterplot
geom_point(aes(color = `07_lid_mean`),
size=4) +
# This section for circling all sample neighborhood points
geom_point(data = csa_tree_temp_demographics %>%
mutate_at(vars("csa2010"), str_to_title) %>%
filter((csa2010 %in% callout_ls)
# Patterson Park must be included seperately because of its unique label positioning
| (csa2010 == "Patterson Park North & East")
),
aes(color = `07_lid_mean`),
size=6, shape = 1) +
# This section shows the trend line
geom_smooth(se = FALSE, # Removes gray banding
method = glm,
color = "black") +
# This section for labeling Canton, etc.
ggrepel::geom_label_repel(data = csa_tree_temp_demographics %>%
mutate_at(vars("csa2010"), str_to_title) %>%
filter(csa2010 %in% callout_ls) %>%
mutate(csa2010 = case_when(
csa2010 == "Greenmount East" ~ "Greenmount East \n(includes part of Broadway East)",
csa2010 == "Clifton-Berea" ~ "Clifton-Berea \n(includes part of Broadway East)",
T ~ csa2010)),
aes(label = csa2010),
min.segment.length = .1,
segment.alpha = .5,
alpha = .85,
nudge_x = .05,
nudge_y = .06) +
# This section for labeling Patterson Park (so its label can be nudged)
ggrepel::geom_label_repel(data = csa_tree_temp_demographics %>%
mutate_at(vars("csa2010"), str_to_title) %>%
filter(csa2010 == "Patterson Park North & East") %>%
mutate(csa2010 = case_when(
csa2010 == "Patterson Park North & East" ~ "Patterson Park North & East \n(includes most of McElderry Park)",
T ~ csa2010)),
aes(label = csa2010),
min.segment.length = .1,
segment.alpha = .5,
alpha = .85,
nudge_x = -.06,
nudge_y = .03) +
# Colors and label formatting follow
#coord_flip() +
scale_colour_gradient(low = "#E0FEA9", high = "#144A11") +
labs(title = "Poverty to Tree Canopy",
subtitle = "Percent of households living below the poverty line \ncompared to the percent of tree cover in the area",
x = "Percent of households living below the poverty line",
y = "Percent of land covered by trees") +
scale_x_continuous(label = scales::percent_format(accuracy = 1.0),
breaks = seq(0, 1, .1)) +
scale_y_continuous(label = scales::percent_format(accuracy = 1.0),
breaks = seq(0, 1, .1)) +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 12))
There are some exceptions to this trend, such as Penn North/Reservoir Hill and Greater Rosemont, which both have relatively high rates of both poverty and tree canopy:
csa_tree_temp_demographics %>%
select(csa2010,
perc_below_poverty = percent_of_family_households_living_below_the_poverty_line,
avg_canopy_2015 = `15_lid_mean`) %>%
mutate(rank_perc_poverty = rank(-perc_below_poverty),
rank_perc_canopy = rank(-avg_canopy_2015)) %>%
arrange(rank_perc_canopy) %>%
filter((csa2010 %like% "%penn north%") | (csa2010 %like% "%rosemont%"))
Average canopy cover by NSA is visualized in the following choropleth map, which was exported from QGIS and prettified slightly in Illustrator:
When comparing temperature to tree canopy, we see a strong correllation of -.89. In other words, places with high temperatures tend to have fewer trees.
#### Build correlation matrix ####
csa_tree_temp_demographics %>%
select(temp_mean_aft,
avg_canopy_2015 = `15_lid_mean`) %>%
as.matrix() %>%
correlate() %>%
mutate(variable=rowname) %>%
select(variable, everything(), -rowname)
We can also view the data ranked by temperature:
csa_tree_temp_demographics %>%
select(csa2010,
temp_mean_aft,
avg_canopy_2015 = `15_lid_mean`) %>%
mutate(rank_temp = rank(-temp_mean_aft),
rank_canopy = rank(-avg_canopy_2015)) %>%
arrange(temp_mean_aft)
The coolest CSA has 11 times as much canopy cover as the hottest:
(csa_tree_temp_demographics$`15_lid_mean`[(csa_tree_temp_demographics$temp_mean_aft == min(csa_tree_temp_demographics$temp_mean_aft))]) /
(csa_tree_temp_demographics$`15_lid_mean`[(csa_tree_temp_demographics$temp_mean_aft == max(csa_tree_temp_demographics$temp_mean_aft))])
## [1] 11.09265
This order of magnitude jump holds true for the second-hottest CSA, but ends at the third.
csa_tree_temp_demographics %>%
select(csa2010,
temp_mean_aft,
avg_canopy_2015 = `15_lid_mean`) %>%
mutate(rank_temp = rank(-temp_mean_aft),
rank_canopy = rank(-avg_canopy_2015)) %>%
filter(between(rank_temp, 1L, 5L) |
between(rank_temp, 51L, 55L)) %>%
arrange(rank_temp)
All relevant data:
nsa_tree_temp %>%
select(nsa_name,
avg_canopy_2007 = `07_lid_mean`,
avg_canopy_2015 = `15_lid_mean`,
lid_change_percent,
lid_change_percent_point) %>%
mutate(is_target_nsa = case_when(
nsa_name %in% target_nsas ~ T,
TRUE ~ F
)) %>%
mutate(rank_canopy = rank(-avg_canopy_2015))
# filter(between(rank_canopy, 1L, 5L) |
# between(rank_canopy, 51L, 55L)) %>%
# arrange(rank_canopy)
What were the citywide gains and losses?
#### INCOMPLETE: WILL USE QGIS TO DO THIS CONFIRMATION ####
How many geographies gained and lost tree cover?
#### NSAs ####
as.data.frame(table(sign(nsa_tree_temp$lid_change_percent))) %>%
mutate(total = sum(Freq),
perc = round(100*(Freq/total), 2))
#### Blocks ####
as.data.frame(table(sign(blocks_tree_temp_demographics$lid_change_percent))) %>%
mutate(total = sum(Freq),
perc = round(100*(Freq/total), 2))
By how many percentage points did each NSA of interest gain or lose?
nsa_tree_temp %>%
select(nsa_name, `07_lid_mean`, `15_lid_mean`, lid_change_percent_point) %>%
filter((nsa_name %in% target_nsas) | (nsa_name %like% "%roland%")) %>%
mutate(lid_change_percent_point = round(lid_change_percent_point*100, 2),
`07_lid_mean` = round(`07_lid_mean`*100, 2),
`15_lid_mean` = round(`15_lid_mean`*100, 2))
redlining_tree %>%
group_by(holc_grade, grade_descr) %>%
summarise(total_area_pixels = sum(`count_all_pix_15`),
total_canopy_pixels = sum(`sum_canopy_pix_15`)) %>%
mutate(avg_canopy_perc = round(100*(total_canopy_pixels/total_area_pixels), 2))
The following data is originally sourced from the Baltimore City Department of Recreation and Parks, as explained in the Code Red Data Cleaning document.
# Colorblind-friendly palette
cbPalette <- c("#999999", # Dark Gray
"#E69F00", # Mustard Yellow
"#56B4E9", # Sky Blue
"#009E73", # Strong Green
"#F0E442", # Lemon Yellow
"#0072B2", # Denim Blue
"#D55E00", # Rust Orange
"#CC79A7") # Lavender
############################################
### Tree HEIGHT by neighborhood ############
############################################
# Plot HEIGHT by nsa for TARGET nsas using controled averages
street_trees_nsa_summarized %>%
filter((nbrdesc %in% target_nsas) | (nbrdesc %in% counterpoint_nsas)) %>%
ggplot(aes(x = reorder(nbrdesc, avg_ht_controled),
y = avg_ht_controled,
fill = factor(is_target_nsa,
# Rename fill levels in legend
labels=c("Counterpoint NSA"," Target NSA"))
)) +
geom_col() +
coord_flip() +
labs(title = "Average Tree Height of Target NSAs",
x = "",
y = "",
fill = "") +
scale_fill_manual(values=cbPalette) +
theme(legend.position = "bottom")
# Plot HEIGHT by nsa for ALL nsas using controled averages
ggplot(filter(street_trees_nsa_summarized, !is.na(avg_ht_controled)),
aes(x = reorder(nbrdesc, avg_ht_controled),
y = avg_ht_controled,
fill = factor(is_target_nsa,
# Rename fill levels in legend
labels=c("Not Target NSA", "Target NSA"))
)) +
geom_col() +
labs(title = "Average Tree Height of NSAs",
x = "",
y = "",
fill = "") +
scale_fill_manual(values=cbPalette) +
theme(legend.position = "top",
axis.text.x = element_text(angle = 90, hjust = 1, size = 5))
It is clear that taller trees are more common in wealthier NSAs compared to poorer ones such as Broadway East.
street_trees_nsa_categorized %>%
select(nbrdesc, tree_ht) %>%
filter((nbrdesc %in% target_nsas) | (nbrdesc %in% counterpoint_nsas)) %>%
group_by(nbrdesc) %>%
summarize(avg_ht = mean(tree_ht),
total_trees = n()) %>%
left_join(
street_trees_nsa_categorized %>%
select(nbrdesc, tree_ht) %>%
filter((nbrdesc %in% target_nsas) | (nbrdesc %in% counterpoint_nsas)) %>%
group_by(nbrdesc) %>%
filter(tree_ht >= 35) %>%
summarize(count_35_or_taller = n())
) %>%
mutate(perc_35_or_taller = round(100*(count_35_or_taller/total_trees), 2))
In addition to tracking each tree, the City of Baltimore also tracks spaces where trees could potentiall by planted. This, along with methodology for determining ease-of-planting classifications, is further explained in the Code Red Data Cleaning document.
#### Relevant info from table ####
street_trees_nsa_summarized %>%
select(1:33) %>%
filter((nbrdesc %in% target_nsas) | (nbrdesc %in% counterpoint_nsas))
Not only are there more trees in Roland Park than Broadway East, but the untreed spaces are considerably easier to plant.
#### Summarize planting difficulty in Broadway East and Roland Park ####
street_trees_nsa_categorized %>%
select(nbrdesc, difficulty_level_num, difficulty_level) %>%
filter((nbrdesc %like% "broadway%") | (nbrdesc %like% "roland%")) %>%
group_by(nbrdesc, difficulty_level_num, difficulty_level) %>%
summarize(count_spaces = n())
Broadway East has 1140 spaces that are available for planting, 1025 of which are not considered “unsuitable.” 96 percent of these are hard to plant (require breaking concrete).
#### Find percent of suitable spaces at each difficulty level for... ####
# ...Broadway East
street_trees_nsa_categorized %>%
# Count the spaces at each level
select(nbrdesc, difficulty_level_num, difficulty_level) %>%
filter((nbrdesc %like% "broadway%") & (difficulty_level != "has live tree")) %>%
group_by(nbrdesc, difficulty_level_num, difficulty_level) %>%
summarize(count_spaces = n()) %>%
ungroup() %>%
mutate(total_vacant_spaces = sum(count_spaces)) %>%
# Count the suitable spaces in Broadway East
left_join(
street_trees_nsa_categorized %>%
select(nbrdesc, difficulty_level_num, difficulty_level) %>%
filter((nbrdesc %like% "broadway%") & (difficulty_level != "has live tree") & (difficulty_level != "unsuitable")) %>%
group_by(nbrdesc, difficulty_level_num, difficulty_level) %>%
summarize(count_suitable = n()) %>%
ungroup() %>%
mutate(total_suitable = sum(count_suitable)) %>%
select(-count_suitable)
) %>%
# Calculate the percent of suitable spaces at each difficulty
mutate(perc_at_difficulty = round(100*(count_spaces/total_suitable), 2))
# ...Roland Park
street_trees_nsa_categorized %>%
# Count the spaces at each level
select(nbrdesc, difficulty_level_num, difficulty_level) %>%
filter((nbrdesc %like% "roland%") & (difficulty_level != "has live tree")) %>%
group_by(nbrdesc, difficulty_level_num, difficulty_level) %>%
summarize(count_spaces = n()) %>%
ungroup() %>%
mutate(total_vacant_spaces = sum(count_spaces)) %>%
# Count the suitable spaces in Roland Park
left_join(
street_trees_nsa_categorized %>%
select(nbrdesc, difficulty_level_num, difficulty_level) %>%
filter((nbrdesc %like% "roland%") & (difficulty_level != "has live tree") & (difficulty_level != "unsuitable")) %>%
group_by(nbrdesc, difficulty_level_num, difficulty_level) %>%
summarize(count_suitable = n()) %>%
ungroup() %>%
mutate(total_suitable = sum(count_suitable)) %>%
select(-count_suitable)
) %>%
# Calculate the percent of suitable spaces at each difficulty
mutate(perc_at_difficulty = round(100*(count_spaces/total_suitable), 2))
Larger (and therefore older) trees in the south-eastern neighborhoods tend to be less healthy than those in the northwest.
#### Relevant info from table ####
street_trees_nsa_categorized %>%
select(nbrdesc, diameter = dbh, condition, is_target_nsa)
When looking at larger trees, Roland Park is ranked 28th in for percent in good condition. Broadway East, by contrast, is ranked 227th of 277 NSAs.
#### Find counts of each tree condition in NSAs of interest, and calculate the percentage that are in good condition ####
street_trees_nsa_categorized %>%
select(nbrdesc, diameter = dbh, condition) %>%
# Filter for only trees with diameter larger than 6 inches
filter(diameter > 6) %>%
group_by(nbrdesc, condition) %>%
summarize(count_at_condition = n()) %>%
spread(condition, count_at_condition) %>%
# Drop non-live trees
select(-absent, -unknown, -dead, -stump) %>%
# Arrange in sensible order
select(poor, fair, good) %>%
ungroup() %>%
# Find percent
mutate(total_live_trees = rowSums(.[2:4]),
perc_good = round(100*(good/total_live_trees), 2)) %>%
arrange(desc(perc_good)) %>%
# Rank by percent
mutate(rank_by_perc_good = rank(-perc_good)) %>%
# Filter for NSAs of interest
filter((nbrdesc %in% target_nsas) | (nbrdesc %in% counterpoint_nsas))